The Impact of OCR Accuracy and Feature Transformation on Automatic Text Classification
Identifieur interne : 000F97 ( Main/Exploration ); précédent : 000F96; suivant : 000F98The Impact of OCR Accuracy and Feature Transformation on Automatic Text Classification
Auteurs : Mayo Murata [Japon] ; P. Busagala [Japon] ; Wataru Ohyama [Japon] ; Tetsushi Wakabayashi [Japon] ; Fumitaka Kimura [Japon]Source :
- Lecture Notes in Computer Science [ 0302-9743 ] ; 2006.
Descripteurs français
- Pascal (Inist)
- Analyse contenu, Analyse documentaire, Caractère imprimé, Classification automatique, Document imprimé, Langage naturel, Longueur mot, Mot, Numérisation, Recherche documentaire, Recherche information, Reconnaissance caractère, Reconnaissance forme, Reconnaissance optique caractère, Structure document, Texte intégral, Traitement image, Variance.
- Wicri :
- topic : Numérisation, Recherche documentaire.
English descriptors
- KwdEn :
- Automatic classification, Character recognition, Content analysis, Digitizing, Document analysis, Document retrieval, Document structure, Full text, Image processing, Information retrieval, Natural language, Optical character recognition, Pattern recognition, Printed character, Printed document, Variance, Word, Word length.
Abstract
Abstract: Digitization process of various printed documents involves generating texts by an OCR system for different applications including full-text retrieval and document organizations. However, OCR-generated texts have errors as per present OCR technology. Moreover, previous studies have revealed that as OCR accuracy decreases the classification performance also decreases. The reason for this is the use of absolute word frequency as feature vector. Representing OCR texts using absolute word frequency has limitations such as dependency on text length and word recognition rate consequently lower classification performance due to higher within-class variances. We describe feature transformation techniques which do not have such limitations and present improved experimental results from all used classifiers.
Url:
DOI: 10.1007/11669487_45
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: 000060
- to stream Istex, to step Curation: 000059
- to stream Istex, to step Checkpoint: 000970
- to stream Main, to step Merge: 001014
- to stream PascalFrancis, to step Corpus: 000299
- to stream PascalFrancis, to step Curation: 000485
- to stream PascalFrancis, to step Checkpoint: 000303
- to stream Main, to step Merge: 001164
- to stream Main, to step Curation: 000F97
Le document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">The Impact of OCR Accuracy and Feature Transformation on Automatic Text Classification</title>
<author><name sortKey="Murata, Mayo" sort="Murata, Mayo" uniqKey="Murata M" first="Mayo" last="Murata">Mayo Murata</name>
</author>
<author><name sortKey="Busagala, P" sort="Busagala, P" uniqKey="Busagala P" first="P." last="Busagala">P. Busagala</name>
</author>
<author><name sortKey="Ohyama, Wataru" sort="Ohyama, Wataru" uniqKey="Ohyama W" first="Wataru" last="Ohyama">Wataru Ohyama</name>
</author>
<author><name sortKey="Wakabayashi, Tetsushi" sort="Wakabayashi, Tetsushi" uniqKey="Wakabayashi T" first="Tetsushi" last="Wakabayashi">Tetsushi Wakabayashi</name>
</author>
<author><name sortKey="Kimura, Fumitaka" sort="Kimura, Fumitaka" uniqKey="Kimura F" first="Fumitaka" last="Kimura">Fumitaka Kimura</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:242F41C85B44E90694E34C3FA935F14E48BF6255</idno>
<date when="2006" year="2006">2006</date>
<idno type="doi">10.1007/11669487_45</idno>
<idno type="url">https://api.istex.fr/document/242F41C85B44E90694E34C3FA935F14E48BF6255/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000060</idno>
<idno type="wicri:Area/Istex/Curation">000059</idno>
<idno type="wicri:Area/Istex/Checkpoint">000970</idno>
<idno type="wicri:doubleKey">0302-9743:2006:Murata M:the:impact:of</idno>
<idno type="wicri:Area/Main/Merge">001014</idno>
<idno type="wicri:source">INIST</idno>
<idno type="RBID">Pascal:08-0029071</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000299</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000485</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000303</idno>
<idno type="wicri:doubleKey">0302-9743:2006:Murata M:the:impact:of</idno>
<idno type="wicri:Area/Main/Merge">001164</idno>
<idno type="wicri:Area/Main/Curation">000F97</idno>
<idno type="wicri:Area/Main/Exploration">000F97</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">The Impact of OCR Accuracy and Feature Transformation on Automatic Text Classification</title>
<author><name sortKey="Murata, Mayo" sort="Murata, Mayo" uniqKey="Murata M" first="Mayo" last="Murata">Mayo Murata</name>
<affiliation wicri:level="1"><country xml:lang="fr">Japon</country>
<wicri:regionArea>Faculty of Engineering, Mie University, 1577 Kurimamachiya-cho, Tsu-shi, 5148507, Mie</wicri:regionArea>
<wicri:noRegion>Mie</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Japon</country>
</affiliation>
</author>
<author><name sortKey="Busagala, P" sort="Busagala, P" uniqKey="Busagala P" first="P." last="Busagala">P. Busagala</name>
<affiliation wicri:level="1"><country xml:lang="fr">Japon</country>
<wicri:regionArea>Faculty of Engineering, Mie University, 1577 Kurimamachiya-cho, Tsu-shi, 5148507, Mie</wicri:regionArea>
<wicri:noRegion>Mie</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Japon</country>
</affiliation>
</author>
<author><name sortKey="Ohyama, Wataru" sort="Ohyama, Wataru" uniqKey="Ohyama W" first="Wataru" last="Ohyama">Wataru Ohyama</name>
<affiliation wicri:level="1"><country xml:lang="fr">Japon</country>
<wicri:regionArea>Faculty of Engineering, Mie University, 1577 Kurimamachiya-cho, Tsu-shi, 5148507, Mie</wicri:regionArea>
<wicri:noRegion>Mie</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Japon</country>
</affiliation>
</author>
<author><name sortKey="Wakabayashi, Tetsushi" sort="Wakabayashi, Tetsushi" uniqKey="Wakabayashi T" first="Tetsushi" last="Wakabayashi">Tetsushi Wakabayashi</name>
<affiliation wicri:level="1"><country xml:lang="fr">Japon</country>
<wicri:regionArea>Faculty of Engineering, Mie University, 1577 Kurimamachiya-cho, Tsu-shi, 5148507, Mie</wicri:regionArea>
<wicri:noRegion>Mie</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Japon</country>
</affiliation>
</author>
<author><name sortKey="Kimura, Fumitaka" sort="Kimura, Fumitaka" uniqKey="Kimura F" first="Fumitaka" last="Kimura">Fumitaka Kimura</name>
<affiliation wicri:level="1"><country xml:lang="fr">Japon</country>
<wicri:regionArea>Faculty of Engineering, Mie University, 1577 Kurimamachiya-cho, Tsu-shi, 5148507, Mie</wicri:regionArea>
<wicri:noRegion>Mie</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Japon</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<imprint><date>2006</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">242F41C85B44E90694E34C3FA935F14E48BF6255</idno>
<idno type="DOI">10.1007/11669487_45</idno>
<idno type="ChapterID">45</idno>
<idno type="ChapterID">Chap45</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Automatic classification</term>
<term>Character recognition</term>
<term>Content analysis</term>
<term>Digitizing</term>
<term>Document analysis</term>
<term>Document retrieval</term>
<term>Document structure</term>
<term>Full text</term>
<term>Image processing</term>
<term>Information retrieval</term>
<term>Natural language</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
<term>Printed character</term>
<term>Printed document</term>
<term>Variance</term>
<term>Word</term>
<term>Word length</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Analyse contenu</term>
<term>Analyse documentaire</term>
<term>Caractère imprimé</term>
<term>Classification automatique</term>
<term>Document imprimé</term>
<term>Langage naturel</term>
<term>Longueur mot</term>
<term>Mot</term>
<term>Numérisation</term>
<term>Recherche documentaire</term>
<term>Recherche information</term>
<term>Reconnaissance caractère</term>
<term>Reconnaissance forme</term>
<term>Reconnaissance optique caractère</term>
<term>Structure document</term>
<term>Texte intégral</term>
<term>Traitement image</term>
<term>Variance</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Numérisation</term>
<term>Recherche documentaire</term>
</keywords>
</textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: Digitization process of various printed documents involves generating texts by an OCR system for different applications including full-text retrieval and document organizations. However, OCR-generated texts have errors as per present OCR technology. Moreover, previous studies have revealed that as OCR accuracy decreases the classification performance also decreases. The reason for this is the use of absolute word frequency as feature vector. Representing OCR texts using absolute word frequency has limitations such as dependency on text length and word recognition rate consequently lower classification performance due to higher within-class variances. We describe feature transformation techniques which do not have such limitations and present improved experimental results from all used classifiers.</div>
</front>
</TEI>
<affiliations><list><country><li>Japon</li>
</country>
</list>
<tree><country name="Japon"><noRegion><name sortKey="Murata, Mayo" sort="Murata, Mayo" uniqKey="Murata M" first="Mayo" last="Murata">Mayo Murata</name>
</noRegion>
<name sortKey="Busagala, P" sort="Busagala, P" uniqKey="Busagala P" first="P." last="Busagala">P. Busagala</name>
<name sortKey="Busagala, P" sort="Busagala, P" uniqKey="Busagala P" first="P." last="Busagala">P. Busagala</name>
<name sortKey="Kimura, Fumitaka" sort="Kimura, Fumitaka" uniqKey="Kimura F" first="Fumitaka" last="Kimura">Fumitaka Kimura</name>
<name sortKey="Kimura, Fumitaka" sort="Kimura, Fumitaka" uniqKey="Kimura F" first="Fumitaka" last="Kimura">Fumitaka Kimura</name>
<name sortKey="Murata, Mayo" sort="Murata, Mayo" uniqKey="Murata M" first="Mayo" last="Murata">Mayo Murata</name>
<name sortKey="Ohyama, Wataru" sort="Ohyama, Wataru" uniqKey="Ohyama W" first="Wataru" last="Ohyama">Wataru Ohyama</name>
<name sortKey="Ohyama, Wataru" sort="Ohyama, Wataru" uniqKey="Ohyama W" first="Wataru" last="Ohyama">Wataru Ohyama</name>
<name sortKey="Wakabayashi, Tetsushi" sort="Wakabayashi, Tetsushi" uniqKey="Wakabayashi T" first="Tetsushi" last="Wakabayashi">Tetsushi Wakabayashi</name>
<name sortKey="Wakabayashi, Tetsushi" sort="Wakabayashi, Tetsushi" uniqKey="Wakabayashi T" first="Tetsushi" last="Wakabayashi">Tetsushi Wakabayashi</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000F97 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000F97 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= ISTEX:242F41C85B44E90694E34C3FA935F14E48BF6255 |texte= The Impact of OCR Accuracy and Feature Transformation on Automatic Text Classification }}
This area was generated with Dilib version V0.6.32. |